4.8 Comparing Algorithms

do not take the variance of the training sets into account, which can be an issue especially if training sets are small and learning algorithms are susceptible to perturbations in the training sets.

「（ここまでは）訓練セットのvarianceを考慮に入れてこなかったが、訓練セットが小さく学習アルゴリズムが訓練セットの揺らぎに影響を受けやすいならばとりわけ問題となりうる」

However, if we consider the comparison between sets of models where each set has been fit to different training sets, we conceptually shift from a model compared to an algorithm comparison task.

「しかしながら、別々の訓練セットで訓練されたモデルの集合間の比較を検討するならば、概念上、モデル比較タスクからアルゴリズム比較タスクへと移れる」

In this case, we are want to find out how different algorithms perform on datasets from a similar problem domain.

「同一の問題ドメインからのデータセットに別々のアルゴリズムがどのような性能を出すかを見出したい」

Dietterich 1998 （積ん読）Approximate statistical tests for comparing supervised classification learning algorithmsの結論

1. McNemar検定

偽陽性率が低い

高速、1度だけで済む

2. proportionの差分を検定

偽陽性率が高い（違いがないにもかかわらず違いがあると検出してしまいやすい）

計算コストが安い

3. Resampled paired t-test（4.9 Resampled Paired t-Test）

偽陽性率が高い

計算コストがとても高い

4. k-fold cross-validated t-test

elated（有頂天な）偽陽性率（？）

訓練セットで再訓練が必要。McNemar検定のk倍の計算

5. 5x2cv paired t-test

5x2-Fold Cross-Validation method

偽陽性率が高い

McNemarテストよりわずかに強力。計算効率が問題ではないときに推奨される（McNemarテストと比べて10回より多く計算）

McNemar’s test is a good choice if the datasets are relatively large and/or the model fitting can only be conducted once. If the repeated fitting of the models is possible, the 5x2cv test is a good choice as it also considers the effect of varying or resampled training sets on the model fitting.

「McNemar検定はデータセットが比較的大きく、かつ/またはモデル訓練が1度だけ実施となる場合によい選択肢となる」

「複数のモデルを繰り返し訓練できるならば、5x2cv検定がよい選択肢となる」

「モデル訓練における訓練セットの変更やリサンプルの影響も考慮しているため」